Efficient Dynamic Dictionary Matching with DAWGs and AC-automata

نویسندگان

  • Diptarama
  • Shunsuke Inenaga
  • Ryo Yoshinaka
  • Ayumi Shinohara
چکیده

The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T . The Aho-Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(d log σ) preprocessing time and O(n log σ + occ) matching time, where d is the total length of the patterns in the dictionary D, n is the length of the text, σ is the alphabet size, and occ is the total number of occurrences of all the patterns in the text. The dynamic dictionary matching is a variant where patterns may dynamically be inserted into and deleted from the dictionary D. This problem is called semi-dynamic dictionary matching if only insertions are allowed. In this paper, we propose two efficient algorithms that can solve both problems with some modifications. For a pattern of length m, our first algorithm supports insertions in O(m log σ + log d/ log log d) time and pattern matching in O(n log σ + occ) for the semi-dynamic setting. This algorithm also supports both insertions and deletions in O(σm + log d/ log log d) time and pattern matching in O(n(log d/ log log d+ log σ) + occ(log d/ log log d)) time for the dynamic dictionary matching problem by some modifications. This algorithm is based on the directed acyclic word graph (DAWG) of Blumer et al. (JACM 1987). Our second algorithm, which is based on the AC-automaton, supports insertions in O(m log σ+uf+uo) time for the semi-dynamic setting and supports both insertions and deletions in O(σm + uf + uo) time for the dynamic setting, where uf and uo respectively denote the numbers of states of which the failure function and the output function need to be updated. This algorithm performs pattern matching in O(n log σ+occ) time for both settings. Our algorithm achieves optimal update time for AC-automaton based methods, since any algorithm which explicitly maintains the AC-automaton requires Ω(uf + uo) update time. Keywords— dynamic dictionary matching, AC-automaton, DAWG

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

0 O ct 2 01 7 Efficient Dynamic Dictionary Matching with DAWGs and AC - automata

The dictionary matching is a task to find all occurrences of pattern strings in a set D (called a dictionary) on a text string T . The Aho-Corasick-automaton (AC-automaton) which is built on D is a fundamental data structure which enables us to solve the dictionary matching problem in O(d log σ) preprocessing time and O(n log σ + occ) matching time, where d is the total length of the patterns i...

متن کامل

Approximate String Matching by Finite Automata

Abs t r ac t . Approximate string matching is a sequential problem and therefore it is possible to solve it using finite automata. A nondeterministic finite automaton is constructed for string matching with k mismatches. It is shown, how "dynamic programming" and "shift-and" based algorithms simulate this nondeterministic finite automaton. The corresponding deterministic finite automaton have O...

متن کامل

Ternary Directed Acyclic Word Graphs

Given a set S of strings, a DFA accepting S offers a very time-efficient solution to the pattern matching problem over S. The key is how to implement such a DFA in the trade-off between time and space, and especially the choice of how to implement the transitions of each state is critical. Bentley and Sedgewick proposed an effective tree structure called ternary trees. The idea of ternary trees...

متن کامل

Text Disambiguation By Finite State Automata, An Algorithm And Experiments On Corpora

The exploration of the context should provide clues that eliminate the non-relevant solutions. For this purpose we use local grammar constraints represented by finite automata. We have designed and implemented an algorithm which performs this task by using a large variety of linguistic constraints. Both the texts and the rules (or constraints) are represented in the same formalism, that is fini...

متن کامل

Inexact Pattern Matching Algorithms via Automata

Pattern matching occurs in various applications, ranging from simple text searching in word processors to identification of common motifs in DNA sequences in computational biology. The problem of exact pattern matching has been well studied and a number of efficient algorithms exist. However these exact pattern matching algorithms are of little help when they are applied to finding patterns in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1710.03395  شماره 

صفحات  -

تاریخ انتشار 2017